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ABSTRACT 

We discuss the use of the Bayesian evidence ratio, or Bayes factor, for model selection in 
astronomy. We treat the evidence ratio as a statistic and investigate its distribution over an en- 
semble of experiments, considering both simple analytical examples and some more realistic 
cases, which require numerical simulation. We find that the evidence ratio is a noisy statistic, 
and thus it may not be sensible to decide to accept or reject a model based solely on whether 
the evidence ratio reaches some threshold value. The odds suggested by the evidence ratio 
bear no obvious relationship to the power or Type I error rate of a test based on the evidence 
ratio. The general performance of such tests is strongly affected by the signal to noise ratio in 
the data, the assumed priors, and the threshold in the evidence ratio that is taken as 'decisive'. 
The comprehensiveness of the model suite under consideration is also very important. The 
usefulness of the evidence ratio approach in a given problem can be assessed in advance of 
the experiment, using simple models and numerical approximations. In many cases, this ap- 
proach can be as informative as a much more costly full-scale Bayesian analysis of a complex 
problem. 

Key words: Statistics; Bayesian methods. 



1 INTRODUCTION 

The apparatus of Bayesian evidence has been proposed as the pre- 
ferred means of answering questions concerning model complexity 
in astronomy (e.g. Trotta 2008). Astronomers commonly wish to 
decide whether a given model fits a dataset adequately, or whether 
there is a need for additional degrees of freedom. Bayesian methods 
are attractive in this context because they expose any assumptions 
or prior information being used, and permit a clear statement of 
the questions that scientists actually ask of their data (e.g. Jaynes 
2003). They have been used extensively in the difficult problems of 
inference that arise in cosmology (e.g. Hobson et al. 2010), and also 
in the complex, multi-parameter modelling needed for the discov- 
ery of exoplanets (e.g. Gregory 2005). Astronomy is by no means 
the only area of science where these methodological questions are 
posed or where Bayesian methods are proposed as the solution; but 
the astronomical literature on the topic raises some issues worth 
treating in context. 

Bayesian methods give a transparent framework for model 
choice, in which it is necessary to define the set of competing 
models explicitly and exhaustively; Bayes' theorem then gives the 
probability of any particular model being correct. Integrating this 
Bayesian probability over the parameter space associated with the 
models (as detailed below in Section 2) then yields an overall ratio 
of odds for particular classes of models: the 'evidence ratio'. The 
requisite multi-dimensional integrations over the parameter spaces 
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of the models present a computational challenge, and astronomers 
have made contributions to the development of these techniques. 
The refinement of Markov Chain Monte Carlo methods for the 
evaluation of multi-dimensional integrals is an example (Skilling 
2004; Mukherjee, Parkinson & Liddle 2006; Feroz, Hobson & 
Bridges 2009). Numerical Bayesian are establishing themselves as 
a default approach, via public-domain packages such as CosmoMC 
(http : / /cos mologist .info/ cosmomc/| ) and MultiNest 
(http : / / ccpforge.cse.rl.ac.uk/gf/project/mu ltinest/) . 

Accepting the evidence ratio methodology, some authors have 
gone further and attempted to place a quality measure on future ex- 
periments according to the expected evidence ratio values that they 
are predicted to yield for given decision problems (Trotta 2007b; 
Heavens, Kitching & Verde 2007). What has been missing from 
this discussion, however, is an assessment of the statistical power 
of the evidence ratio: different realizations of data for a given ex- 
perimental configuration will yield different values of the evidence 
ratio, and we need to know how often the method will discrimi- 
nate correctly between models, and how often it will fail. This is a 
frequentist view of a Bayesian tool, but there is no conflict: the evi- 
dence ratio is a statistic generated from a dataset, so it is legitimate 
to ask how it will behave under repeated trials. 

We will discuss several examples where the evidence ratio can 
be used for model choice, and we will examine the statistical varia- 
tion that results from different realizations of the data. The variation 
is considerable, and we argue that this is likely to be generally true. 
This suggest caution in the use of evidence ratios, but it suggests 
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that simplified methods can be used to compute evidence ratios and 
check their robustness. 

Notwithstanding these caveats, we do advocate a more 
widespread use of the evidence ratio technique in astronomy. 
Bayesian methods are currently usually employed on complex, 
high- value problems; but astronomers are also interested in sim- 
pler model choice problems where the Bayesian techniques have 
much to offer and are much easier to use (at least in an approx- 
imate way). It is feasible to experiment with these simpler cases 
and get a good sense of the robustness of the method. Approximate 
Bayesian methods may often be as good as is justified by the data. 



2 THE BAYESIAN EVIDENCE RATIO METHOD 

Suppose we have just two models Ho and H\, associated with sets 
of parameters a and /3. For data D, Bayes' theorem gives the pos- 
terior probabilities of the models and their parameters: 

P{Ho, a I D) oc P(D I Ho, 3) x P{a \ Ho) x P{Ho) (1) 

and 

P{HiJ\ D) oc P{D I Hi J) X P0\Hi) x P{Hi). (2) 

Here the priors are, for instance, P{a \ Ho), the probability distri- 
bution of the parameters given model Ho, multiplied by the prior 
probability of the model class Ho itself. We can often avoid the 
(common) normalizing factor required in these equations. It di- 
vides out whenever we take the ratio to form relative probabilities 
or 'odds'. 

The restriction to two models is not fundamental. Often Ho is 
the 'null' or default hypothesis and is relatively simple and well un- 
derstood. It is vital that Hi be reasonably comprehensive, covering 
a range of possibilities, as otherwise the evidence ratio formalism 
may result in high odds in favour of one of the models when both 
are a poor fit. 

The term 'model' can commonly be applied to each distinct 
point in parameter space, but a distinct question is how reasonable 
a given class of model is in the face of some data. When we discuss 
'model selection', we are thus interested in the general viability 
of Ho or Hi, irrespective of the exact value of their parameters. 
Integrating out the parameters gives the posterior probabilities of 
Ho and Hi , conditional on the data. The ratio of these probabilities 
is the posterior odds, O: 

(n^ P(H^ \D) _ IP(D\ HiJ)P0\Hi)dp P[H,) 
P{Ho I D) J P{D I Ho, q)P(5 | Ho) da P{Ho) 

We assume that our set of possible models is exhaustive, so that 
P{Hi I D) + P{Ho \D)^1, the probability of Ho is 



£ = 



P{Ho I D) 



(4) 



l + O 

For more than two models, this does not hold, but O always gives 
the relative probabilities of any two models. 

The odds ratio O updates the prior odds on the models, 
P{Hi) /P(Ho), by a factor that depends on the data: 



(Posterior odds) = (evidence ratio) x (prior odds), 



or 

= £ xVO, 



(5) 



(6) 



JC0\Hi)P0\Hi)d0 
j£.{a I Ho)P{a \ Ho) da' 



(7) 



The priors have to be properly normalized and may be quite dif- 
ferent for Ho and H\. If the models in question are also hard to 
calculate, the computational problem is large. 

A decision about which model to prefer thus requires both the 
evidence ratio and the prior ratio. The prior ratio is often taken as 
unity, but this is not always justified. For example, one might be 
reluctant to accept (say) H\ with 100 free parameters if Ho had 
no free parameters. The evidence ratio contains a different penalty 
for unnecessary complexity in the models: models are penalized if 
a small part of their prior parameter range matches the data. This 
is often called the Ockham 'factor' (e.g. p348 of Mackay 2003), 
although it is not usually an explicit multiplicative penalty based 
on the number of parameters. 

In this paper, we will always take the prior ratio to be unity, 
in the interests of brevity. This allows us in our examples to use 
'evidence ratio' and 'odds' interchangeably, the latter being often 
more illuminating. 

The roles of the priors on the parameters, and the Ockham 
penalty, have been extensively discussed. Recent examples include 
Trotta (2008) and Niarchou, Jaffe & Pogosian (2004). In hard prob- 
lems, the prior and the likelihood can be of similar importance in 
determining the value of the integral, and their product may be 
multi-peaked or otherwise pathological. 

Many interesting cases are however much easier. The first ex- 
amples we will discuss can be solved analytically. More generally, 
if our data are informative, the likelihood function may be consider- 
ably narrower than the prior. The priors can then be approximated 
by constants over the relevant range of the parameters in the evi- 
dence integrals. Furthermore, in simple cases the integrand may be 
close to Gaussian around its peak, in which case the consequent 
integration of a multivariate Gaussian can be done analytically: 



L(a) P{a) da : 



m/2 



■.L{a')P{a*), 



(8) 



where the definition of the evidence ratio £ involves integrals over 
the likelihood function times the priors on the parameters: 



where a* is the value at the peak of the likelihood, H is the Hessian 
matrix of second derivatives of the log of the likelihood at the peak, 
and m is the number of parameters. This equation is known as the 
Laplace approximation, or the method of steepest descent (see e.g. 
p341 of Mackay 2003). 

The integration then reduces to the less laborious task of find- 
ing the maximum posterior probability, and evaluating the matrix 
T-i. Averaging H over many realizations of the data yields the Fisher 
matrix, which may be inverted to yield an approximate prediction 
for the covariance matrix of the parameters (e.g. Tegmark, Taylor 
& Heavens 1997). 

The Laplace approximation may not be valid, since the pos- 
terior may not be Gaussian near its peak, or there may be multiple 
peaks of similar height. The applicability of the approximation thus 
needs to be checked, at least via inspection of the posterior, or via 
comparison with an alternative robust means of integration, such as 
Monte Carlo. Monte Carlo methods can be also used to quantify 
the robustness of the evidence ratio for different realizations of the 
data. In addition to providing possible indications of multimodality 
in the posterior, such an approach can also probe the stability of the 
evidence ratio against systematic error at plausible levels. 
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3 THE EVIDENCE AS A STATISTIC 
3.1 Repeated experiments 

Posterior odds on hypotheses can be obtained from a given dataset 
without ever having to consider whether the experiment could be 
repeated. This is the well-known advantage of Bayesian reasoning 
over the frequentist approach. Yet many kinds of experiment can 
be and are repeated many times: observations in astronomical sur- 
veys are an obvious example. In these circumstances, the Bayesian 
evidence is a statistic that can be computed from a given dataset, 
and it is hard not to wonder what value might have been obtained 
had our dataset been a different realization of the experimental pro- 
cess. Clearly if we know the likelihood function, which we must 
do to compute the evidence, then we must be able to generate other 
possible realizations of the data. 

To make a decision based on the posterior odds, a threshold is 
set at some value of posterior odds, such as the 'decisive' In f = 5 
value advocated by Jeffreys (1961). We may ask how often such a 
strategy might lead us to make an incorrect decision. Conversely, 
it is useful to know if a given experimental setup is likely to yield 
data good enough to exceed the decision threshold and 'detect' the 
more complex model in cases where it is true. 

Suppose further that the evidence ratio turns out to be a 'noisy' 
statistic, in the sense that its distribution is very broad: in this case, 
there is little point in devoting excessive effort in computing the ev- 
idence ratio very precisely. Given that practical computations can 
involve difficult integrations over spaces of very high dimensional- 
ity, this is worth knowing. 

The notion of 'repeated trials' needs clarification. In the sim- 
plest case, the fluctuations in our data arise in the measurement pro- 
cess, while the object or process we are observing has fixed param- 
eters. A distinct case arises when we make repeated measurements 
of objects or processes that are different on each repetition. This 
often happens when we are observing samples and wish to make 
statements about properties of whole populations. In this case, ex- 
tra variance enters, often called cosmic variance. An elementary 
example is the distinction between repeated (noisy) measurements 
of the flux of a single galaxy, or a series of measurements where 
a different galaxy is observed on each occasion. In the latter case, 
we will have a prior distribution for the true flux of a randomly se- 
lected galaxy, and the data we obtain in a given measurement could 
be modelled by drawing a random number from this prior distribu- 
tion, and then adding noise. 

In dealing with the evidence ratio for repeated trials, we will 
thus use the prior twice. The standard Bayesian approach regards 
the data as being fixed specific numbers, and the prior enters only 
when we average the likelihood function over the prior to obtain 
the posterior probabilities. However, when we view the Bayesian 
outputs as statistics, we have to treat the data as random variables, 
whose distribution will depend on the values of the parameters for 
which we have a prior. The probability distribution of the evidence 
ratio involves the data, and so depends on the unknown parameters 
that are the argument of the prior. We can eliminate these parame- 
ters by a further integration over the prior, in effect, marginalizing 
the distribution of the evidence ratio to obtain its probability distri- 
bution independent of parameters. 



3.2 Neyman-Pearson Analysis 

Suppose we have the posterior probabilities or posterior odds for 
our competing models. These will vary with different realizations 



of the data. What do we do with these probabilities or odds? This 
is not a question that can be answered by probability theory but it 
can be illuminated by it. 

One approach is to set a threshold in the odds, effectively tak- 
ing one decision if our experiment gives posterior odds above the 
threshold, and another if they lie below. This general idea was in- 
troduced to classical statistics by Neyman and Pearson. A Bayesian 
approach is to emphasize the posterior probabilities or odds as a 
complete summary of our state of knowledge after the experiment, 
and to resist further interpretation. There are parallels here with the 
long history of controversy in classical statistics about the Neyman- 
Pearson approach versus Fisher's significance testing. Fisher rec- 
ognized the utility of the Neyman-Pearson method in industrial ac- 
ceptance testing but regarded it as too "wooden" to be useful in 
the ill-defined and creative processes of science (Fisher 1956). His 
detestation of Bayesian methods aside. Fisher would perhaps have 
been sympathetic to the idea that posterior probabilities should be 
carried forward intact through the processes of science; he took 
much the same view of the results of his tests of significance. 

We believe that binary choice between alternatives is a rele- 
vant process in astronomy. The high cost and complexity of many 
astronomical research projects requires a difficult decision on when 
to commit to construction, which is irrevocable once made. More 
generally, there is the whole issue of how a community develops 
a consensus. The Bayesian ideal of a set of individuals each inter- 
preting the evidence ratio in the their own way is hardly realistic: 
rather, some pre-defined level of proof is needed - a threshold in 
evidence ratio, in short. We will therefore apply a Neyman-Pearson 
style of analysis to the evidence ratio, despite recognizing that this 
is not a unique assessment of is utility. 

Given the distribution of the evidence ratio £ under two com- 
peting hypotheses, we can ask how well the statistic performs. A 
Neyman-Pearson analysis proceeds by defining a critical threshold 
in the test statistic, say £c- If £ < fc, we do not see any reason to 
reject the simpler null hypothesis Ho, and it is accepted. \f£> £c, 
there is good reason to prefer the more complex hypothesis H\ and 
Ho will be rejected. A common 'decisive' choice for the critical 
threshold is In £c = 5, corresponding to odds in favour of H\ of 
148:1 (Jeffreys 1961; Jaynes 2003). The common restriction to two 
models is not critical, since we can always add the posterior prob- 
abilities for TV alternative models and consider this to be a single 
alternative. This yields sensible answers, even in the case where 
all TV models fit the data about as well as Hq: if TV > 148, we 
would then decide that there was decisive evidence against Ho, 
even though Ho fitted as well as any model. This simply reflects 
our assumption that all models are equally likely a priori. 

In the Neyman-Pearson approach, there are two ways in which 
an incorrect conclusion might be reached: 

(1) Type I Error (False Positive). Ho is true, but we are unlucky 
enough to get a high value of £ above £c, so Ho is incorrectly 
rejected. The Type I error rate is a = P{£ > £c \ Ho). 

(2) Type II Error (False Negative). H\ is true, but a value of £ 
below the threshold is found, so we fail to 'detect' the need for a 
more complex model. The power is 1-Type II error rate and so is 
V = P{£<£c\ Hi). 

The power of the test is defined as the probability that we will cor- 
rectly pick Hi when it is true - i.e. it is unity minus the probabil- 
ity of a Type II error. There is the usual trade-off: if we conserva- 
tively use a high threshold, we reduce the chance of a Type I error, 
but we also reduce the power of the test because we are increas- 
ing the probability of a Type II error. The power is less often dis- 
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cussed in astronomy, because alternative hypotheses are frequently 
ill-defined. However, the notion of these alternatives is inherent in 
the Bayesian method. 

An example might be a case where we calculate a statistic, say 
chi-square, to evaluate the goodness of fit of a model Hq. From this 
we may compute what is often called the p-value, the probability 
of exceeding the value of chi-square we actually obtained, assum- 
ing Ho to be correct. This is classical significance testing. In the 
Neyman-Pearson method, we would fix in advance a critical value 
of chi-square, corresponding (say) to p = 0.05. If we exceed this 
critical level (by any amount) we reject Hq; if we only have two 
models, we then accept H\. 

The Neyman-Pearson binary decision rule avoids the paradox- 
ical issue of the relation between p and P{Ho \ D), which has been 
pondered by a long series of authors (Lindley 1957; Jeffreys 1961 
and others cited in Berger & Sellke 1987; Sellke, Bayarri & Berger 
2001). Paradoxes arise because we might think that if we obtained 
a small value of p, it would follow that Hq was unlikely to be cor- 
rect. While this is not a rigorous interpretation of the meaning of 
p, successive authors have calibrated p for a range of models, and 
found that p < P{Ho \ D). This situation can be understood if we 
select outcomes with given p from an ensemble of repeated experi- 
ments in which Hq and Hi are equally likely. At one extreme, the 
two models may be rather similar, in which case an outcome with 
any p is equally probable on either model; if the models are ex- 
tremely different. Hi might always yield p <C 1, so observing e.g. 
p = 0.05 could in fact provide strong reason to prefer Hq. Thus Hq 
can be more likely than the value of p might seem to suggest, and 
a calibration of p is required: without this, it is hard to know what 
decisions to take on the basis of p. This discussion has continued 
since Fisher introduced p and the idea of significance testing. 

The Neyman-Pearson approach, by contrast, is quite clear 
about how statistics should be used to take decisions and it is for 
this reason that we analyze the performance of the Bayesian ev- 
idence ratio using the concepts of Type I error rate and power. 
Our analytical models are the same as those discussed before 
(Lindley 1957; Jeffreys 1961, for example) and we reproduce the 
p < P{Hq I D) effect. But this is not our focus; our interest is in 
showing how the Bayesian evidence ratio performs, as a basis for 
decision, under repeated trials. 

An advantage of the Neyman-Pearson approach is that be- 
cause there is a clear decision rule, the risks are also clear. For 
example, if there is a cost of some sort associated with wrongly 
rejecting the null hypothesis, then the threshold can be set to min- 
imize this. As we shall see, however, this also affects the power - 
and may affect the likely pay-off of correctly choosing the alterna- 
tive. 

The Neyman-Pearson lemma tells us that a statistic based on 
the likelihood ratio will be the best one to use as a basis for deci- 
sion. The evidence ratio is quite closely related to a likelihood ratio 
and so this is another reason for examining its performance from a 
Neyman-Pearson point of view. 

We also note that the Neyman-Pearson approach can also be 
assessed in a Bayesian way: we might ask, what is the probability 
of (say) Hi, given that I have just obtained £ > fc? This quantity, 
P{Hi I £ < £c), is called the 'positive predictive power' in med- 
ical literature. It is related by Bayes' Theorem to the Type I and 
Type II error rates, and the prior odds ratio on Hi and Hq: 



PPP = 



er 



(9) 



prior odds on Hi . Evidently the use of the PPP requires us to set a 
threshold in advance on our test statistic, much as in the Neyman- 
Pearson approach. We might then pose similar counterfactual ques- 
tions such as, if we have obtained £ < £c and we then choose Hi 
with probability PPP, what might our loss be if Hq is true? 



4 GAUSSIAN EXAMPLES 

We will now examine two contrasting Gaussian examples. In the 
first, both priors are very narrow and the evidence ratio acts in the 
same way as a classical statistic for model choice. In the second 
we have one prior much wider than the other. The evidence ratio 
method now diverges strongly from a classical alternative and the 
role of the Ockham factor is apparent. In both cases we will see how 
finite amounts of data reduce the effectiveness of decisions based 
on thresholds in the evidence ratio. 

We have A*' data values, Xi, which may have arisen via one of 
two models: 

(1) Independent drawings from a Gaussian of unit standard de- 
viation and mean zero. 

(2) Independent drawings from a Gaussian of unit standard de- 
viation and mean fi. 

As usual, we assume that these two possibilities are a priori equally 
probable. This is a case where the two models are nested: model 1 
is a special case of model 2{p = 0). 

In astronomical terms, this situation might correspond to a 
(one-dimensional) Gaussian source in the zero-background limit 
(X-ray astronomy). The observed A'^ photon locations are Xi, and 
we wish to see if there is evidence that the source is offset from 
some pre-determined position. As noted, there are various possibil- 
ities one may wish to test. One is that the source is supposed to be 
at one of two definite positions, both of which we know. Another is 
that the source is either at one definite position, which we know, or 
it is located 'somewhere else'. Here we put a Gaussian prior on the 
alternative position and so know the parameters of that prior (most 
interestingly, the spread). We now deal with these cases in turn. 



4.1 Example 1 

The models Hq and Hi hypothesize that the TV data Xi are drawn 
from unit Gaussians of mean zero and mean /i respectively. The 
evidence ratio £ is then very simple: 



N 

2\n£ = ^{X, 

2 = 1 



N 

Y^X^=N{^^ 



2Mfi), (10) 



where M is the mean of the A'^ samples. Clearly AI will be Gaus- 
sian, of variance 1 /N, under either hypothesis, and so the logarithm 
of the evidence ratio is also Gaussian. This means that the evidence 
ratio will have considerable scatter, provided Nfi'^ 3> 1. 

Our decision procedure is to reject Hq (and therefore accept 
Hi) when £ exceeds some threshold £c- This occurs when M ex- 
ceeds 



Mo = 



(11) 



a + e-p' 

in which a is the false positive rate, V is the power, and O is the 



We make a Type I error (incorrectly rejecting Hq) when M exceeds 
this threshold and Hq is true. The probability of this is the proba- 
bility that AI exceeds Ale when M is a Gaussian of mean zero and 
variance 1 /N, which is 
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Figure 1. The probability distributions of In £ under Hq (left) and H\ 
(right), assuming N = A and fi = I. For the ai'bitrary choice ln£^c = 1, 
the dark shaded area gives the Type I error rate, and the light shaded area 
gives the Type II error rate. 
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Figure 2. Parametric curves of the power of the evidence ratio test for ex- 
ample 1, as a function of the probability of Type I error, with the parameter 
being the evidence ratio. Curves are shown for fi = 1 and N = 2 (low- 
est curve), with A'^ increasing for each curve by a factor 2. The dots show 
evidence ratios (green, blue, red) of e^, and e*. 



\n£c^ ^,(^y/Nt-^y (14) 

Suppose we design an experiment to choose between Hq 
(/i — 0) and Hi {fi = 1), based on the common decisive thresh- 
old In £c = 5. Once we have obtained our data and calculated the 
evidence ratio £, we pick Hi if £ > £c and expect the odds on 
this being the correct choice to be 148 to 1. Consider for simplicity 
the case A'^ = 1: the critical t for this example is then 5.5 and we 
would need a 5.5-sigma result to reject Ho- This a very conservative 
procedure: the Type I error rate is 10~^'^ and the power is a pal- 
try 10~^'^. The analysis is warning us that setting the critical odds 
at the apparently desirable 148 to 1 means we will rarely exceed 
the evidence ratio threshold. As with any classical test statistic, it 
makes no sense to set a critical value which will hardly ever be ex- 
ceeded for the amount of data available. This makes it inevitable 
that Ho will not be rejected. We need more data, or a less stringent 
acceptance procedure, as Fig. 2 shows. 

Suppose we carried out this experiment and obtained a single 
datum that did indeed yield t > 5.5, In £c > 5. Formally we should 
choose Hi - but common sense tells us that our datum is next to 
impossible under either Ho or Hi. It would be more reasonable 
to look for some missing hypothesis H2, and certainly prudent to 
check the goodness of fit of the apparently-favoured Hi. We will 
make similar comparisons for our other examples. 



4.1.1 ROCandAUC 

The plot of power against Type I error rate is sometimes known 
as the Receiver Operating Characteristic, or ROC, the name aris- 
ing from its origin in radar. It is used a good deal in medicine (e.g. 
Zweig & Campbell 1993). A useful quantity could be the integral 
under the ROC curve, known as the AUC (Area Under the Curve). 
In our application, it is not difficult to show that the AUC is the 
probability that the evidence ratio, assuming Hi, exceeds the ev- 
idence ratio, assuming Ho- This condenses the ROC into a single 
number, which for our examples is quite close to unity. This may 
be a useful compression in some cases, but it does lose the possibil- 
ity of ascribing weight to the degree by which the evidence ratios 
exceed each other. 



We make a Type II error (incorrectly failing to reject Hq) when M 
is less than Mc and Hi is true. The probability of this is 

Fig. [T] illustrates the possibilities. From these expressions we can 
calculate curves of power versus Type I error level, in which /j, and 
A'^ are parameters. These are shown in Fig.|2] 

In a classical procedure we would base our acceptance of Hq 
on whether M differs significantly from zero. Specifically, we re- 
ject Ho if M > t/v'iV where t is a parameter analogous to £c, 
which we choose to determine the trade-off between power and 
Type I error. We find that this test is identical in form to the one 
based on the evidence ratios (so the curves are the same in Fig.|2ll. 
However, standard values of In £c give very low powers and very 
conservative Type I error rates, as the figure shows. We can see why 
from the interesting relationship between the two approaches: 



4.2 Example 2 

In this example, the null hypothesis Ho remains that the random 
variables Xi are drawn from a Gaussian of mean zero and unit 
variance. The alternative hypothesis Hi is that the Xi are drawn 
from a Gaussian of mean /i and unit variance. However, in this case 
we do not know ^ - we assume that the prior on /i is Gaussian, 
of mean zero and known standard deviation a. This formulation 
poses the question, 'is fi zero or is it non-zero but with a restricted 
range of possibilities?' The models are nested because if a = 
then Hi reduces to Ho. Hence Hi is a model that requires an extra 
free parameter. A model similar to this was considered by Jeffreys 
(1961), who examined the contrast between p-value and posterior 
probability of Hq. 

This example shows a standard technique for creating a rea- 
sonably comprehensive alternative hypothesis - the use of a hierar- 
chical model (Gelman & Hill 2007). Here the introduction of one 
extra parameter (a) gives us a wide range of possible alternatives 
for /i, which in the previous example we had to set case-by-case. 

The evidence ratio, in the sense Hi/Ho, is now 
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E 



2-Ko exp(-^^(X0V2) 



(15) 



The likelihood term depends on ^ but not a, which enters through 
the prior on /x. The ratio simplifies to 



£■ = 



(iVC)l/2cr 



exp 



where 



C = 1 + l/Na 



(16) 



(17) 



which tends to unity as a becomes large. Evidently, the distribution 
of £ under repeated trials will be determined by the distribution of 
Xi, which will be Gaussian under either Hq or H\. 
To calculate this in detail, we define: 



j/ = Clnf + |ln(iVCa2 
which becomes 



y 



(18) 



(19) 



Under Hq, the sum is Gaussian of mean zero and variance A'^, so 
that y (and hence \n£) is a variable with one degree of freedom. 
Its density is 

dP/dy = (7ry)-^/2exp(-2/), (20) 

This immediately tells us that In f is a variable and so £ will 
have considerable scatter, affecting the power of the test. 

The case of Type II error requires a little more thought. In this 
case, the Xi are Gaussian with mean fj,, and so equation l |16l l can 
only give us the distribution of £ conditional upon ^, which we do 
not know. It is natural however to marginalize over the prior on ^ 
to obtain an unconditional distribution for £. This is an important 
conceptual step in the analysis, putting the prior spread in a param- 
eter, II, on the same footing as spread in the data. The result is that 
the distribution of £ is broadened beyond what would be the case 
if we considered repeated trials in which only measuring error (the 
distribution around fixed ^l) caused fluctuations in the result. We 
see no alternative to this conclusion: the existence of a prior on ^ 
means that it must be treated as a random variable, whose value is 
undetermined before we perform an experiment. The larger the un- 
certainty in fj,, the larger the scatter in the values of £ that we can 
obtain. 

It follows that the distribution of £ depends on the distribution 
of ^ Xi marginalized over the prior. For Hi we then find: 



dP/dz — (ttz) "'"''^ exp(— 2 



(21) 



with z = y/{l + 7V(T^). Returning to the Neyman-Pearson anal- 
ysis, since P{£) d£ — P{y) dy we can change to our convenient 
variable y and integrate over Gaussians to obtain 



P(type I) 



and 



P(type II) 



P(£:imodel 1) d£ ■ 



erf (Vy^) 



P(£:|model 2) d£ = erf {y/T^) 



(22) 



(23) 



with Zc = yc/{i + Na^). The threshold j/c is related to our choice 
of critical £c via equation l |18t . 

Fig. [3] shows curves for the power versus significance level 
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Figure 3. A plot of the power of the evidence ratio test for example 2, as 
a function of the probability of Type I error. Curves are shown for Na-^ = 
2 (lowest curve), increasing for each curve by a factor 2. The dots lie at 
evidence ratios of e^(green), e"^ (blue) and (red). 



as a function of Na'^. This parameter expresses the dependence of 
test performance on the amount of data (A'^) and the prior degree 
of difference between the proposed models (a). In this plot, it is 
remarkable where standard choices of critical evidence ratio lie. 
Take ln£c — 5 for definiteness: at, say, Na^ = 8 we find the 
significance level to be 2 x 10~* and the corresponding power to 
be 0.19. 

In words, this means that if we require the odds on Hi to be 
148 to 1 or stronger, then we will reject Ho incorrectly only one 
time out of 5000 trials, and we will pick Hi when we should only 
one time out of 5 trials. This is an excessively conservative decision 
procedure and parallels what we saw in the first example. 

Evidently, we will need much smaller critical odds than 148 
to 1 to get reasonable performance from this test. To re-emphasize, 
this is a function of the chosen critical evidence ratio, not the form 
of the test, which is intuitive. The problem is that In £ is noisy for 
small amounts of data and cannot sustain such decisive tests as are 
implied by ln£c = 5 (for example). Fig.|4]illustrates this point. 

This test might intuitively be derived without the evidence ra- 
tio, focusing on the test statistic Xi)^, where the square en- 
ters to allow for the possibility that the actual, non-zero /j, can be 
be of either sign. For simplicity, consider the form y we defined 
before, which is chi-square distributed (see equation [T9ll. A simple 
test could be, reject Ho ify > yc, where the critical j/c corresponds 
to some desired significance level or probability p of Type I error. 
The power of this proposed test is conditional upon /i. Marginaliz- 
ing out jj, with the Gaussian prior, as before, the forms of the power 
and significance level turn out to be identical to those based on the 
evidence ratio test. So the test is natural enough; the difficulty arises 
in choosing a sensible value for the critical evidence ratio. This is 
exactly the same difficulty that occurs in choosing the significance 
level in any classical test. 

It might seem even more 'natural' for this problem to choose 
Xf as a test statistic. Working through the Neyman-Pearson 
analysis in this case is not possible analytically, as non-central chi- 
square distributions arise. However, a numerical analysis shows 
that this 'natural' procedure only performs better than the evidence 
ratio for small Na^ < 3. So the evidence ratio method is not triv- 
ially intuitive. 
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Figure 4. A plot of the power and significance levels of the evidence ratio 
test for example 2, as a function of the critical evidence ratio. Curves are 
shown for Na'^ = 2 (lowest curve), increasing for each curve by a factor 
2. Significance level is in blue, power in red. The changing slopes of the 
significance fines are real. 

4.3 Conclusions from the Gaussian examples 

These simple examples show that the odds that we calculate from 
the evidence ratio may not be useful for making decisions if we 
take account of statistical variations over an ensemble of datasets. 
A decisive threshold of In £ = 5 can in many cases be exceeded 
only a small fraction of the time when Hi is true, even when such a 
value is effectively impossible under Hq. In other words, the test is 
extremely safe (very hard to reject the null hypothesis incorrectly), 
but lacking in power (little ability to detect the alternative). This 
asymmetry between type I and type II performance seems unde- 
sirable, particularly because the problem is set up so that there are 
only two possibilities. If Ho is clearly inconsistent with the data, 
then Hi must be correct according to the problem as given - even 
if the Bayesian evidence ratio is only moderate. Again, this sug- 
gests that we should be free to challenge the statistical formulation 
and conclude that neither Hq nor Hi are correct. Fisher might have 
regarded this as the correct (less "wooden") approach. 



5 A LINE FITTING EXAMPLE 

We now consider two more complex examples, where we are inter- 
ested in which of two models is a better fit to spectral line data. We 
will use Monte Carlo simulation to assess the statistical scatter be- 
tween different realizations of the data. We will consider two cases, 
one 'nested' (whether there is an extra component to a spectral line) 
and one not nested (whether a line has a Gaussian or Lorentzian 
profile). 

Suppose we are trying to decide if a spectral feature is a sin- 
gle Gaussian (the null hypothesis Ho) or two Gaussians, of equal 
width, known separation, but unknown height ratio (the alternative 
hypothesis Hi). This is a nested model because if the height ratio is 
zero. Hi reduces to Ho. The relevant parameters are the baseline; 
the height, width, and centre of the main line; and the height ratio 
for the subsidiary line. The models are: 

Hq: y = ai + 02 exp ( — — !-:j(3; — 04)^ I (24) 
Hi: y = /3i+/32exp(^-^(a;-/?4)'^ 



+/35 exp (^^{x-Pi- 3/33)' j . (25) 

The extra feature is located a known three standard deviations away 
from the main one. We also treat the noise levels as free parameters 
to be determined from the data; this is realistic because we may not 
know the noise level very well. We again assume that each model 
is a priori equally likely, and that the noise is normally distributed. 
The models need priors on the parameters, which we describe later. 
In the Neyman-Pearson framework, our decision rule for this ex- 
ample will be: accept Hi if the evidence exceeds a critical value. 

The Monte Carlo modelling process involves the following 
steps, some repeated. 

(1) To create the noise-free spectrum under Hq we take ai — 
Pi = 0, 02 = /32 = 1, aa = /33 = 1, Q4 = /34 = in equation 
l l24t . The noise-free spectra are sampled on a pixel grid of spacing 
1/5 in the above units. 

(2) The noise-free spectrum under Hi is created with the same 
parameters as for Ho in equation i25l but the key parameter /Ss, 
the strength of the satellite line, now enters. We will assume that 
the prior for Ps is uniform between zero and 0.1, so we are looking 
for a satellite line that we expect to be at most 10% of the main 
line. The presence of this prior introduces a factor of 0.1 into the 
evidence ratio; we assume that the priors on the other parameters 
are the same for Ho and Hi . 

(3) To estimate the Type I error rate, we generate simulated 
data with Ho assumed true, adding Gaussian noise. We define the 
signal-to-noise ratio as the ratio of the peak level (unity) to the stan- 
dard deviation of the added noise. We fit the forms for Ho and Hi 
to these simulated data and compute the evidence ratio in the sense 
£ = evidence for Hi I evidence for Ho, using the Laplace approx- 
imation. This gives the evidence ratio when the data are created on 
the assumption the Hq is true. 

(4) To compute the power, we need to compute the evidence 
ratio using data generated data for the case where Ho is false. We 
will do this by assuming that Hi is true. We generate simulated 
data under Hi by adding noise to the noise-free spectrum under 
Hi. Choosing the random values of /?5, the satellite line strength, 
from its prior naturally marginalizes the distribution of the evidence 
ratio over the range of prior assumed strengths for the satellite line 
(the evidence ratio is a very strong function of this line height). 

(5) We fit the forms for Ho and Hi again and compute the ev- 
idence ratio. Examples of the fits under Ho and Hi are shown in 
Fig.S 

The use of the Laplace approximation is justified by examin- 
ing the likelihood functions and finding them to be close to Gaus- 
sian - a check that should always be made. 

The trends of the evidence ratio with signal-to noise ratio are 
plotted in Fig. |6] which shows the median and the interquartile 
range for the log of the evidence ratio, plotted against the signal- 
to-noise ratio. We see that the evidence ratio or odds for Ho, if it is 
true, do not get very big compared to the odds for Hi . This is what 
we expect from a nested model, as Hi can always do just as well 
as Ho, with only the Ockham penalty for extra complexity - not 
severe for only one extra parameter. It follows from our decision 
rule that the Type I error rate is quite low. Indeed, for the standard 
decisive ratio of — 148, the Type I error rates are exceedingly 
small - the decision rule is very conservative at these signal-to- 
noise ratios. On the other hand, clearly if Hi is true we will often 
find values below the critical value and so the power is not large. 
Ultimately we can trace this to the width of the prior on the satellite 
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Figure 5. Both panels show fits of Gaussians, where the bimodal model H\ 
is favoured at odds of 20 to 1 . In the upper panel, the fit of the model under 
Hi is shown, in the lower, under Hq . 

line height; we are too vague about what we are looking for to have 
high power. This point arises again in the next example. 

The utility of the proposed decision rule is summarized in Fig. 
[T] which shows the power and Type I error rate as a function of 
decision threshold (the chosen critical evidence ratio) and signal- 
to-noise ratio. This diagram is specific to the problem at hand, but 
interesting points emerge. Evidently, the combination of the crit- 
ical evidence and the signal-to-noise ratio determines where the 
decision rule places one in the diagram. Standard decisive thresh- 
olds like result in low power and a very small Type I error rate - 
less than 1/500 with our number of repetitions of the Monte Carlo 
simulation. This may not be what is needed. 

For comparison, we also apply a Bayesian Information Crite- 
rion (BIC; see e.g. Liddle 2007). In our case this means we pick the 
model with the smallest value of the normalized sum of squares 
plus the penalty term ln(number of data points) x (number of 
model parameters). The number of data points is the number of 
spectral channels - evidently this number is somewhat vague as 
not all channels are equally informative. 

The BIC rule, while offering no choices, sits in a useful place 
in the diagram for this relatively simple problem and is no worse in 
power than the evidence ratio. Finally, we note that different deci- 
sion rules (for example, accepting Ho if the evidence for it is bigger 
than the evidence for H\) result in a different diagram. 

For a second example, we consider trying to decide if a line 
profile is Gaussian (-ffo) or Lorentzian (-ffi). Here we have 

Ho: y = ai -I- a2 exp ( -— !-:j(a; - Q4)^ I (26) 
V 2a| J 




15 20 25 30 35 40 



Signal-to-noise ratio 

Figure 6. The evidence ratio for the first line-fitting example, in the sense 
evidence for H\ I evidence for Hq, plotted against signal-to-noise ratio. 
These results include the effects of the priors that are described in the text. 
The red curve assumes Ho is true (no satellite line) and the black curve 
assumes Hi is true. The central curve is the median and the flanking soHd 
lines are the 25th and 75th percentiles. 500 iterations were used at each 
noise level. The horizontal lines mark odds on Hi that are even, 2;1, 4:1, 
and 8:1. 
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Figure 7. The power and Type I error rate for the first line-fitting example 
are shown for signal-to-noise ratios of 40 (topmost line), 35, 30... The crit- 
ical ratio varies along each line, with points indicating a critical evidence 
ratio in favour of Hi of 2 (red), 4 (blue) and 8 (green). Grey points are for 
the simple case of picking the model with the smaller BIC. At each signal- 
to-noise ratio - (Ss combination, 500 iterations were used. The Type I error 
rates are therefore not reliable near 1/500 - the power should be zero for if 
the Type 1 error rate is zero, but the curves are too steep near the origin to 
resolve properly with Monte Carlo. 



y-^^^ i + i^ P.mr '''' 

The simulation proceeds very much as in the first case, except that 
we assume the priors are the same for the two models; this is justifi- 
able since each of the parameters has a very similar effect in either 
model. We return to the priors later. The decision rule is, accept Hi 
(the Lorentzian) if the evidence for it is bigger. 

Fig. [8] shows examples of fits. Because the models are not 
nested, and there is no Ockham factor in play, the odds (at reason- 
able signal-to-noise ratios) are much stronger than in the previous 
example. Fig. |9] shows the trends of evidence ratio with signal-to- 
noise. There is much less spread in this case because it lacks the 
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Figure 8. This shows a case where a Lorentzian (lower panel) is favoured 
over a Gaussian (upper panel) at odds of 60: 1 . These impressive odds result 
from exponentiating small differences in the wings of the lines. 

additional variability introduced into the previous example by the 
prior on the height of the satellite line. 

Fig. [To] shows the Type 1 error rates and powers in the same 
format as before. Performance is better (lower Type 1 error rate, and 
higher power, at the same signal-to-noise ratio), reflecting the fact 
that the two models are more distinct. Unlike the previous example, 
the BIC gives better power but worse Type I error rate. 

Two points emerge that are more specific to the Bayesian con- 
text. One is the way we have formulated the decision rule, in terms 
of accepting Hi. In the double-line example, it is clear than Hi is 
the more complex model, and classically we would probably have 
focused on whether or not we accepted Ho - The decision rule, 'ac- 
cept Ho if the evidence for it is bigger than for Hi ' gives a different 
power - Type 1 error rate tradeoff. The same is true for the second 
example, where this formulation (accepting the Gaussian, in that 
case) gives a much higher power, and much worse Type 1 error 
rate. 

The second point relates to the role of priors. There is no Ock- 
ham factor at work in the second example - but the odds that arise 
seem implausibly large, just looking at the fits. This happens be- 
cause in the simulations we fit exactly the right model to the fake 
data. A related point is that our model space contains either Ho or 
Hi and nothing else. In reality it seems more likely that we would 
know Hq, the default, null, or starting hypothesis fairly well, but 
the alternative might be rather vague Again, this reiterates the les- 
son of the Gaussian examples, where we warned against adopting 
too restricted a set of models. 

We can see the effects of this by a simple change to the Monte 
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Figure 9. The evidence ratio, in the sense evidence for Ho I evidence for 
Hi , for the second line-fitting example, plotted against signal-to-noise ratio. 
The red curve assumes Ho is true. The central curve is the median and the 
flanking solid lines are the 25th and 75th percentiles. 500 iterations were 
used at each noise level. The horizontal lines mark odds on Hi that are 
even, e^:l, e^:l and e*:l. 



Carlo modelling. For the case Hi true, we generate the fake data 
from the simple Lorentzian. We then fit a more complex model, a 
Lorentzian plus a quadratic baseline ax + bx^. This corresponds to 
a case where Nature is simple, but we do not know this. Without 
accounting for priors, we find the odds on the Lorentzian model 
drop by five orders of magnitude. The factors accounting for the 
priors on a and b can recover much of this. For example, if we 
think in advance ax + bx^ should be less than the line height (1) 
over the range of the data (±10) then we expect the prior range in 
a to be ~ 0.1 and in 6 to be ~ 0.01, which allows us to recover 
three orders of magnitude in the evidence ratio. The second panel 
of Fig. [To] incorporates the nett loss of two orders of magnitude in 
the odds for Hi, and we see a large effect in the power. The Type 
I error rate remains the same because we have not changed Ho in 
any way, reflecting its assumed role as the default, better-defined 
hypothesis. This part of the example shows how model uncertainty 
may play a large role in the quality of decisions made with a limited 
suite of options. 



5.1 Conclusions from the Monte Carlo examples 

The direct simulations clarify some key aspects of the evidence 
ratio method. The role of the prior is apparent; uncertainty in 
our model parameters can range from sampling noise dominated 
(spread in prior values ^ spread due to measurement error) 
through to noise-dominated. The meta-problem of model uncer- 
tainty is also illustrated. A spurious restriction of the range of ap- 
plicable models is punished by equally bogus levels of certainty. 
In this regard, simple goodness-of-fit criteria have much to offer as 
complements to formal procedures of model choice. 

Treating the evidence ratio as a statistic shows that it has fa- 
miliar features. Choosing a 'decisive' value, in the face of poor 
signal-to-noise ratio or similar models, results in conservative de- 
cision procedures with low power to pick alternatives. Really this 
is just a reprocessing of the lack of information in the data, but the 
evidence ratio encodes this fact in an obscure way. There is no ob- 
vious relationship between the posterior odds, the Type 1 error rate, 
or the power: signal-to-noise trumps all of these. 
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Figure 10. The power and Type I error rate for the second Hne-fitting exam- 
ple are shown for signal-to-noise ratios of 40 (topmost line), 35, 30... The 
critical ratio varies along each line, with points indicating critical odds on 
Ho of e^:l (red), e^:i (blue) and e^:l (green). Grey points are for the sim- 
ple case of picking the model with the smaller BIC. The upper diagram has 
no quadratic baseline; the lower includes this as a possibihty, as described 
in the text. Not all signal-to-noise ratios yield curves that lie in the plotted 
ranges of power and Type I error rate. 



6 WHY DOES THE EVDJENCE RATIO SCATTER SO 
MUCH? 

We have discussed at length a number of specific examples, which 
raise various questions concerning the evidence ratio methodol- 
ogy. The common feature has been the large dispersion in the ev- 
idence ratio, and the associated poor performance as a statistical 
tool for decision making. We now have to ask if our examples are 
just pathological cases, or if there is a general reason for the large 
scatter in the evidence ratio. 

Suppose we are dealing with Gaussian statistics and we are fit- 
ting functions y — f{x, a) and y — g{x, /3) to data Yi, with known 
noise, at a set of points Xi. In the Laplace approximation, the statis- 
tics of the evidence ratio are largely dominated by the likelihood 
ratio 



£ = 



exp(-^x'(Q*)) 
exp(-ix2(/f*))' 



(28) 



where the star superscript denotes the maximum likelihood value 
of the parameters. The logarithm of this is 



\ i i 



(29) 



Collecting terms, the contribution at each i to the summation is 

{f{x,,a*f ^ {g{x,J*f) + {2(f{x,,a*)-g{x,J*))Y,) .(30) 

If / (for example) is the correct model, then Yi ~ f{xi,a*) + 
Zi , where Zi is a Gaussian variate by assumption. Introducing the 
distance h between the models by h{xi, a* , P*) = f{xi,a*) — 
g{xi, /3*) we see we have a sum of terms in which the stochastic 
component is of the form 



SlnC — y ^ h{xi, a* , l3*)Zi/a^ 



(31) 



where a denotes the rms noise. 

We therefore expect the logarithm of the evidence ratio (the 
log odds) to have a Gaussian distribution, with the width of this 
Gaussian determined by the detail of the distance function h. There 
are some special cases where this width might be small. One is 
where the 'incorrect' model g is just the 'correct' model /, plus 
some term that is linear in the parameters. This of course would be 
true for any pair of linear models where g was a more complicated 
version of /. In such a case, h can be small. More generally, if / and 
g are nested models, with a parameter that is free to be determined 
in / but fixed in g, then h will also be small if the fixed value is 
close to the optimum. 

In general, however, apart from special cases, we should ex- 
pect that the log-odds will have a degree of variance for different 
realizations of the data. The extent of the variance will depend on 
the details of the models being compared but can be considerable 
for commonplace problems. Since the variations in h that will be 
consistent with either model will naturally be ~ a, this suggests a 
scatter in In £ of order unity, as found in our examples. 

Finally we note the similarity of this treatment to the classical 
likelihood ratio test. In this test, the logarithm of the ratio of max- 
imum likelihoods is found to be distributed like chi-square, if the 
models are nested. The relevance for us is that here we have a test 
statistic that is of very similar form to the evidence ratio, and which 
will inevitably show considerable scatter - although the exact de- 
gree of scatter must be calculated in individual cases, most simply 
by Monte Carlo simulation. 



7 SUMMARY AND CONCLUSIONS 

We have discussed the application of the Bayesian evidence ratio 
to some simple problems of model selection, which are illustrate 
issues encountered in astronomical applications. We have treated 
the evidence ratio as a statistic that can be computed for a given 
dataset. Using analytical arguments and Monte Carlo simulation, 
we have investigated the distribution of evidence ratio values that 
results from an ensemble of experiments. Although the Bayesian 
approach treats the data as given, our data are but one sample of 
a range of possibilities. The evidence ratio calculation indeed as- 
sumes that we know the distribution of the data, so a calculation 
that incorporates this random element is always possible. All ex- 
periments could in principle be repeated (even the single datum of 
our Universe, if we accept the Landscape picture - e.g. Susskind 
2003). 

In the examples we have investigated, we find that the evi- 
dence ratio has a large dispersion about its ensemble average, and 
we have given arguments to suggest that such behaviour is likely to 
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be general. We therefore suggest that one should compute not only 
the value of £, but also its distribution. 

Of course, at any one time, we have to work with the data that 
exist, and nothing that we write here should be taken as challeng- 
ing the fundamental Bayesian paradigm: modulo the often unjustly 
neglected priors on model classes, the evidence ratio does express 
our full knowledge of the relative odds on different models. But 
when one moves beyond this statement to a decision (the evidence 
is large enough, so I will build a factory to produce a new drug, or I 
will launch a new Great Observatory), then we need to know more 
about the evidence than a single numerical value. 

This point is of particular importance in calculations that at- 
tempt to predict the potential decisiveness of future experiments, as 
in e.g. Heavens et al. (2007) or Trotta (2007b). The expected evi- 
dence ratio for a given experiment, {£), is an interesting quantity 
to know - but one may not want to choose an experimental strategy 
that maximises this quantity, if the price is an increased scatter. 

The 'noisy' nature of the evidence ratio has a number of im- 
plications. At the practical level, there is a limit to how much effort 
it is worth investing in numerical algorithms for accurate computa- 
tion of the evidence ratio: arguably, there is no point in evaluating 
£ to better than a factor of 2 numerical precision. The Laplace ap- 
proximation may be useful here, as may judicious use of approxi- 
mate combinations of models (sometimes called 'toy' models, al- 
though this term can be an injustice). 

Most seriously, the scatter in £ means that it is inevitable that 
a good fraction of experiments will fail to find decisive evidence 
in favour of a more complex model, even when it is true (a large 
'Type ir error rate, or low power). In such cases, do we accept that 
the simpler model is still an acceptable description of the data? The 
problem with this conclusion is that performance of the evidence 
ratio when viewed as a statistic seems to be asymmetric between 
Type I and Type II errors. There may well exist levels of evidence 
that are far from being decisively in favour of a complex model 
{£ ~ 1), and are yet close to impossible on a simpler model. It such 
circumstances, it hardly seems sensible to persist with the simple 
model. 

This reasoning is certainly reminiscent of the current con- 
troversy over whether a scale-invariant spectrum of cosmological 
mass fluctuations is ruled out (only moderate evidence in favour of 
tilted models \n£ = 2.8, according to Trotta 2007a, even though 
the observed deviation from n = 1 is '3.3cr'). It would be inter- 
esting to study this situation using the approach of Monte Carlo 
realizations that we have advocated here, to see what the evidence 
ratio is really telling us. 

In such circumstances, where the evidence ratio fails to dis- 
criminate clearly between alternative models, and yet the avail- 
able data are a poor match to the null model, we have to ques- 
tion whether the problem is correctly formulated. While the null 
hypothesis will normally be rather specific, the alternative can be 
vague, and difficult to reduce to a single model (a point made by 
Efstathiou 2008 in his critique of the evidence ratio approach). But 
if the null model gives a poor fit to the data to hand, we have good 
reason to believe that we need another model, even if the the evi- 
dence ratio is unable to convince us that this must be the supplied 
alternative. The strength of the Bayesian approach is that it focuses 
our attention on the actual models we are considering. But with- 
out a goodness-of-fit test, the result may simply demonstrate that 
we lack the imagination to create a sufficiently exhaustive set of 
models. 
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